tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

Sileo, Damien

Computer Science > Computation and Language

arXiv:2301.05948 (cs)

[Submitted on 14 Jan 2023 (v1), last revised 16 May 2023 (this version, v3)]

Title:tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

Authors:Damien Sileo

View PDF

Abstract:The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different schemas, making harmonization challenging. Multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns across previous preprocessing efforts, such as column name mapping and extracting specific sub-fields from structured data in a column. We then propose a structured annotation framework that ensures our annotations are fully exposed and not hidden within unstructured code. We release a dataset annotation framework and dataset annotations for more than 500 English tasks\footnote{\url{this https URL}}. These annotations include metadata, such as the names of columns to be used as input or labels for all datasets, which can save time for future dataset preprocessing, regardless of whether our framework is utilized. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACM classes:	I.2.7
Cite as:	arXiv:2301.05948 [cs.CL]
	(or arXiv:2301.05948v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2301.05948

Submission history

From: Damien Sileo [view email]
[v1] Sat, 14 Jan 2023 16:38:04 UTC (88 KB)
[v2] Fri, 10 Feb 2023 09:35:32 UTC (89 KB)
[v3] Tue, 16 May 2023 08:19:44 UTC (181 KB)

Computer Science > Computation and Language

Title:tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators